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Abstract. We consider a natural generalization of the classical pat- 
tern matching problem: given compressed representations of a pattern 
p[l . . M] and a text t[l . . N] of sizes m and n, respectively, does p occur 
in i? We develop an optimal linear time solution for the case when p 
and t are compressed using the LZW method. This improves the pre- 
viously known 0((n + m) log(n + m)) time solution of Gasieniec and 
Rytter |1(J| . and essentially closes the line of research devoted to study- 
ing LZW-compressed exact pattern matching. 
Key-words: pattern matching, compression, Lempel-Ziv 

1 Introduction 

One of the most natural problems concerning processing information is pattern 
matching, in which we are given a pattern p[l . . M] and a text t [1 . . N] , and have 
to check if there is an occurrence of p in t. Although many very efficient (both 
from a purely theoretical and more practically oriented) solution to this problem 
are known [15I7I12I8I4I3] , most data is archived and stored in a compressed form. 
This suggest an intriguing research direction: if the text, or both the pattern and 
the text, are given in their compressed representations, do we really need to de- 
compress them in order to detect an occurrence? If just the text is compressed, 
this is the compressed pattern matching problem. For Lempel-Ziv- Welch com- 
pression, used for example in Unix compress utility and GIF images, Amir et 
al. introduced two algorithm with respective time complexities 0(n + M ) and 
0(nlogAf + M), where n is the compressed size of the text. The pattern pre- 
processing time was then improved [14] to get 0(n + M 1+e ) time complexity. 
In a recent paper [TT] we proved that in fact a Oin + M) solution is possible, 
as long as the alphabet consists of integers which can be sorted in linear time. 
A more general problem is the fully compressed pattern matching, where both 
the text and the pattern are compressed. This problem seems to be substan- 
tially more involved than compressed pattern matching, as we cannot afford to 
perform any preprocessing for every possible prefix/suffix of the pattern, and 
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such preprocessing is a vital ingredient of any efficient pattern matching algo- 
rithm known to the author. Nevertheless, Gasieniec and Rytter [TU] developed a 
0((n + m) log(n + to)) time algorithm for this problem, where n and to are the 
compressed sizes of the text and the pattern, respectively. 

In this paper we show that in fact an optimal linear time solution is possible 
for fully LZW-compressed pattern matching. The starting point of our algorithm 
is the (D(n + M) time algorithm [11]. Of course we cannot afford to use it di- 
rectly as M might be of order to 2 . Nevertheless, we can apply the method to a 
few carefully chosen fragments of the pattern. Then, using those fragments we 
try to prune the set of possible occurrences, and verify them all at once by a 
combined usage of the so-called PREF function and iterative compression of the 
(compressed) pattern similar to the method of [10]. The chosen fragments cor- 
respond to the most complicated part of the pattern, in a certain sense. If there 
is no such part, we observe that the pattern is periodic, and a modification of 
the algorithm from [11] can be applied. To state this modification, and prove its 
properties, we briefly review the algorithm from |11] in the next section. While 
the modification itself might seem simple, we would like to point out that it is 
nontrivial, and we need quite a few additional ideas in order to get the result. 

2 Preliminaries 

We consider strings over finite alphabet S (which consists of integers which can 
be sorted in linear time, namely E — {1, 2, . . . , (n + to) c }) given in a Lempel- 
Ziv- Welch compressed form, where a string is represented as a sequence of code- 
words. Each codeword is either a single letter or a previously occurring codeword 
concatenated with a single character. This additional character is not given ex- 
plicitly: we define it as the first character of the next codeword, and initialize 
the set of codewords to contain all single characters in the very beginning. The 
resulting compression method enjoys a particularly simple encoding/decoding 
process, but unfortunately requires outputting at least il(vN) codewords (so, 
for example, we are not able to achieve an exponential compression possible in 
the general Lempel-Ziv method). Still, its simplicity and good compression ra- 
tio achieved on real life instances make it an interesting model to work with. 
For the rest of the paper we will use LZW when referring to Lempel-Ziv- Welch 
compression. 

We are interested in a variation of the classical pattern matching problem: 
given a pattern p[l . . M] and a text i[l . . N], does p occur in tl We assume that 
both p and t are given in LZW compressed forms of size n and to, respectively, 
and wish to achieve a running time depending on the size of the compressed 
representation n + to, not of the original lengths N and M. If the pattern does 
occur in the text, we would like to get the position of its first occurrence. We 
call such problem the fully compressed pattern matching. 

A closely related problem is the compressed pattern matching, where we aim 
to detect the first occurrence of an uncompressed pattern in a compressed text. In 
a previous paper [11] , we proved that this problem can be solved in deterministic 
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linear time. This 0(n + M) time algorithm will be our starting point. Of course 
we cannot directly apply it as M might be of order m 2 . Nevertheless, a modified 
version of this solution will be one of our basic building bricks. In order to state 
the modification and prove its correctness we briefly review the idea behind the 
original algorithm in the remaining part of this section. First we need a few 
auxiliary lemmas. A period of a word w is an integer < d < \w\ such that 
w[i] = w[i + d] whenever both letters are defined. 

Lemma 1 (Periodicity lemma). If both d andd! are periods ofw, andd+d' < 
\w\ + gcd(d, d'), then gcd(d, a") is a period as well. 

Using any linear time suffix array construction algorithm and fast LCA 
queries pQ we get the following. 

Lemma 2. Pattern p can be preprocessed in linear time so that given any two 
fragments p[i . . i + k] and p[j . . j + k] we can find their longest common prefix 
(suffix) in constant time. 

Lemma 3. Pattern p can be preprocessed in linear time so that given any frag- 
ment p[i . . j] we can find its longest prefix which is a suffix of the whole pattern in 
constant time, assuming we know the (explicit or implicit) vertex corresponding 
to p[i . . j] in the suffix tree. 

A border of a word w is an integer < b < \w\ such that w[l . . b] — w[\w\ — 
b. . \w\]. By applying the preprocessing from the Knuth- Morris-Pratt algorithm 
for both the pattern and the reversed pattern we get the following. 

Lemma 4. Pattern p can be preprocessed in linear time so that we can find the 
border of each its prefix (suffix) in constant time. 

A snippet is any substring p[i . . j] of the pattern, represented as a pair of 
integers (i, j). If i = 1 we call it a prefix snippet, and if j — \p\ a suffix snippet. 
An extended snippet is a snippet for which we also store the corresponding vertex 
in the suffix tree (built for the pattern) and the longest suffix which is a prefix of 
the pattern. A sequence of snippets is a concatenation of a number of substrings 
of the pattern. 

A high-level idea behind the linear time LZW-compressed pattern matching 
is to first reduce the problem to pattern matching in a sequence of extended 
snippets. It turns out that if the alphabet is of constant size, the reduction can 
be almost trivially performed in linear time, and for polynomial size integer 
alphabets we can apply more sophisticated tools to get the same complexity. 
Then we focus on pattern matching in a sequence of snippets. The idea is to 
simulate the well-known Knuth-Morris-Pratt algorithm while operating on whole 
snippets instead of single characters using Lemma [5] and Lemma [6j 

Lemma 5. Given a prefix snippet and a suffix snippet we can detect an occur- 
rence of the pattern in their concatenation in constant time. 
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Lemma 6. Given a prefix snippet p[l . . i] and a snippet p[j . . k] we can find the 
longest long border b of p[l . . i] such that p[l . . b]p[j . . k] is a prefix of the whole p 
in constant time, where a long border is b > | such that p[l . . b] = p[i — b+l..i]. 

During the simulation we might create some new snippets, but they will be 
always either prefix snippets or half snippets of the form p[| . . i]. All information 
required to make those snippets extended can be precomputed in a relatively 
straightforward way using 0(m) time. 

The running time of the resulting procedure is as much as 0(n log m) , though. 
To accelerate it we try to detect situations when there is a long snippet near the 
beginning of the sequence, and apply Lemma[7]and Lemma[8]to quickly process 
all snippets on its left. 

Lemma 7. Given a sequence of extended snippets S1S2 ■ . ■ S{ such that \si\ > 

2 X)j<i \ s j\> we can detect an occurrence of p in S1S2 ■ ■ ■ Si in time O(i). 

Lemma 8. Given a sequence of extended snippets S1S2 ■ ■ ■ Si such that \si\ > 
%J2j<i \ s j\i we can compute the longest prefix of p which is a suffix of S1S2 ■ ■ ■ Si 
in time 0(i). 

After such modification the algorithm works in linear time, which can be 
shown by defining a potential function depending just on the lengths of the 
snippets, see the original paper. 

3 Overview of the algorithm 

Our goal is to detect an occurrence of a pattern p[l . . M] in a given text t[l . . N], 
where p and t are described by a Lempel-Ziv- Welch parse of size m and n, re- 
spectively. The difficulty here is rather clear: M might be of order m 2 , and hence 
looking at each possible prefix or suffix of the pattern would imply a quadratic 
(or higher) total complexity. As most efficient uncompressed pattern algorithms 
are based on a more or less involved preprocessing concerning all prefixes or 
suffixes, such quadratic behavior seems difficult to avoid. Nevertheless, we can 
try to use the following reasoning here: either the pattern is really complicated, 
and then m is very similar to M, hence we can use the linear compressed pat- 
tern matching algorithm sketched in the previous section, or it is in some sense 
repetitive, and we can hope to speedup the preprocessing by building on this 
repetitiveness. In this section we give a high level formalization of this intuition. 

We will try to process whole codewords at once. To this aim we need the 
following technical lemma which allows us to compare large chunks of the text 
(or the pattern) in a single step. It follows from the linear time construction of 
the so-called suffix tree of a tree |16) and constant time LCA queries pQ . 

Lemma 9. It is possible to preprocess in linear time a LZW parse of a text over 
an alphabet consisting of integers which can be sorted in linear time so that given 
any two codewords we can compute their longest common suffix in constant time. 
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We defer its proof to Section [6] as it is not really necessary to understand the 
whole idea. As an obvious corollary, given two codewords we can check if the 
shorter is a suffix of the longer in constant time. 

In the very beginning we reverse both the pattern and the text. This is 
necessary because the above lemma tells how to compute the longest common 
suffix, and we would actually like to compute the longest common prefix. The 
only way we will access the characters of both the pattern and the text is either 
through computing the longest common prefix of two reversed codewords, or 
retrieving a specified character of a reversed codeword (which can be performed 
in constant time using level ancestor queries), hence the input can be safely 
reversed without worrying that it will make working with it more complicated. 
We call those reversed codewords blocks. Note that all suffixes of a block are 
valid blocks as well. 

We start with classifying all possible patterns into two types. Note that this 
classification depends on both m (size of the compressed pattern) and n (size of 
the compressed texts) which might seem a little unintuitive. 

Definition 1. A kernel of the pattern is any (uncompressed) substring of length 
n + m such that its border is at most n + m . A kernel decomposition of the pattern 
is its periodic prefix with period at most n "y" followed by a kernel. 

Note that the distance between two occurrences of such substring must be at 
least "~^ m , and hence a kernel occurs at most 2m times in the pattern and 2n 
times in the text. It might happen that there is no kernel, or in other words all 
relatively short fragments are highly repetitive. In such case the whole pattern 
turns out to be highly repetitive. 

Lemma 10. The pattern either has a kernel decomposition or its period is at 
most n + m. Moreover, those two situations can be distinguished in linear time, 
and if a decomposition exists it can be found in the same complexity. 

Proof. We start with decompressing the prefix of length n + m. If its period 
d is at least n % m , we can return it as a kernel. Otherwise we compute the 
longest prefix of the pattern and the pattern shifted by d characters (or, in other 
words, we compute how far the period extends in the whole pattern). This can 
be performed using at most 2(n + m) queries described in Lemma [9] (note that 
being able to check if one block is a prefix of another would be enough to get a 
linear total complexity here, as we can first identify the longest prefix consisting 
of whole blocks in both words and then inspect the remaining at most min(n, m) 
characters naively). If d is the period of the whole pattern, we are done. Otherwise 
we identified a substring s of length n + m followed by a character a such that the 
period of s is at most d < but the period of sa is different (larger). We 

remove the first character of s and get s'. Let d' be the period of s'a. If d' > n + m , 
s'a is a kernel. Otherwise d, d' < n + rn ~ 1 are both periods of s', and hence by 
Lemma [l] they are both multiplies of the period of s'. Let b be the character 
such that d is a period of s'b (note that a ^ b). Because d is a period of s'b, 
s'[\s'\ + 1 — d] = b. Similarly, because d' is a period of s'a, s'[\s'\ + 1 — d'] = a. 
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Hence s'[\s'\ + l-d\ ^ s'[\s'\+l-d'], and because (\s'\ + l-d) - (\s'\ + 1- d!) is 
a multiple of the period of s' we get a contradiction. Note that the prefix before 
s'a is periodic with period d < "+ m by the construction. 

If the pattern turns out to be repetitive, we try to apply the algorithm de- 
scribed in the preliminaries. The intuition is that while we required a certain 
preprocessing of the whole pattern, when its period is d it is enough preprocess 
just its prefix of length O(d). This intuition is formalized in Section [4] If the 
pattern has a kernel, we use it to identify 0(n) potential occurrences, which we 
then manage to verify efficiently. The verification uses a similar idea to the one 
from Lemma [9] but unfortunately it turns out that we need to somehow com- 
press the pattern during the verification as to keep the running time linear. The 
details of this part are given in Section [5] 

4 Detecting occurrence of a periodic pattern 

If the pattern is periodic, we would like to somehow use this periodicity so that 
we do not have to preprocess the whole pattern (i.e., build the suffix tree, LCA 
structure, compute the borders of all prefixes and suffixes, and so on). It seems 
reasonable that preprocessing just the first few repetitions of the period should 
be enough. More precisely, we will decompress a sufficiently long prefix of p and 
compute some of its occurrences inside the text. To compute those occurrences 
we apply a fairly simple modification of Levered-PATTERN-MATCHING called 
Lazy- levered-pattern- matching . 

First observe that both Lemma[5]and Lemma[7]can be modified in a straight- 
forward way so that we get the leftmost occurrence, if any. The original procedure 
quits as soon it detects that the pattern occurs. We would like it to proceed so 
that we get more than one occurrence, though. A naive solution would be to 
simply continue, but then the following situation could happen: both I and \sk\ 
are very close to m, the pattern occurs both in the very beginning of p[l . . £]sk 
and somewhere close to the boundary between the two parts, and the longest 
suffix of the concatenation which is a prefix of the pattern is very short. Then 
we would detect just the first occurrence, and for some reasons that will be clear 
in the proof of Lemma [14] this is not enough. Hence whenever there is an occur- 
rence in the concatenation, we skip just the first half of p[l . .£] and continue, 
see lines [TlJT4| This is the only change in the algorithm. 

While Lazy-levered-pattern-MATCHING it is not capable of generating 
all occurrences in some cases, it will always detect a lot of them, in a certain 
sense. This is formalized in the following lemma. 

Lemma 11. // the pattern of length m occurs starting at the i-th character, 
Lazy-levered-pattern-MATCHING detects at least one occurrence starting at 
the j-th character for some 

Proof. There are just two places where we can lose a potential occurrence: line [7] 
and [12] More precisely, it is possible that we output an occurrence and then skip 
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Algorithm 1 Lazy-levered-pattern-matching(si, s 2 , . . . , s n ) 
1: I <s— longest prefix of p ending si > Lemma [3] 



2: fc 2 

3: while k < n and £ + YH=k \ Si \ — m 

4: choose t > k minimizing \sk\ + Sfc+i| + • • ■ + \ s t-i\ — %^ 

5: if 1 + \sk \ + Sfc+i 1 + . . • + s*-i| < -^p then 

6: output the first occurrence of p in p[l . . £].SkSk+i ■ ■ ■ St, if any > Lemma m 

7: £ longest prefix of p ending p[l . .^SkS^+i ■ ■ ■ St > Lemma [8^ 

8: fc«-t + l 

9: else 

10: output the first occurrence of p in p[l . . (\sk, if any > Lemma [H] 

11: if p occurs in p[l . . l]sk then 

12: i 4— longest prefix of p ending p[ [| ] . . £] 

13: continue 

14: end if 

15: if p[l . . £]sk is a prefix of p then 

16: £^£+\s k \ 

17: fc^fc + 1 

18: continue 

19: end if 

20: b <r- longest long border of p[l . . £] s.t. p[l . . b]sk is a prefix of p > 

Lemma [6] 

21: if b is undefined then 

22: £ <— longest prefix of p ending p[|~f ] ■ ■£] 

23: continue 

24: end if 

25: £^b+\s k \ 

26: k 4- k + 1 

27: end if 

28: end while 



a few others. We would like to prove that the occurrences we skip are quite close 
to the occurrences we output. We consider the two problematic lines separately. 

line [7] s t is a lever, so i + \s±\ + . . . + \s t \ < §m. Hence the distance between 
any two occurrences of the pattern inside p[l . . £]skSk+i ■ ■ ■ s* is at most y . 
We output the first of them, and so can safely ignore the remaining ones. 

line 1121 If there is an occurrence, we remove the first half of p[l . . £] and might 
skip some other occurrences starting there. If the first occurrence starts later, 
we will not skip anything. Otherwise we output the first occurrence starting 
in p[l • • |], and if there is any other occurrence starting there, their distance 
is at most | < y, hence we can safely ignore the latter. 

Lemma 12. Lazy-levered-PATTERN-MATCHING can be implemented to work 
in time 0(n) and use 0(m) additional memory. 
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Proof. The proof is almost the same as in [TT]. The only difference as far as 



the running time is concerned is line 12 By removing the first half of p[l . . £} 



we either decrease the current potential by 1 or create a lever and thus can 
amortize the constant time used to locate the first occurrence of the pattern 
inside p[l . . £]sk- 

Note that Lazy-levered-pattern-matching works with a sequence of 
snippets. By first applying the preprocessing mentioned in the preliminaries we 
can use it to compute a small set which approximates all occurrences in a com- 
pressed text. 

Lemma 13. Lazy-levered-pattern-matching can be used to compute a set 
S of 0{n) occurrences of an uncompressed pattern of length m > n in a com- 
pressed text such that whenever there is an occurrence starting at the i-th char- 
acter, S contains j from {i — — ?r + 1, . . . 

Proof. As mentioned in the preliminaries, we can reduce compressed pattern 
matching to pattern matching in a sequence of snippets in linear time. Because 
m > n, the preprocessing does not produce any occurrences yet. Then we ap- 
ply Lazy-levered-pattern-matching. Because its running time is linear by 



Lemma 12 it cannot find more than 0{n + m) occurrences. A closer look at the 
analysis shows that the number of occurrences produced can be bounded by the 
potentials of all sequences created during the initial preprocessing phase, which 
as shown in [TT] is at most 0(n). 

Now it turns out that if the pattern is compressed but highly periodic, the 
occurrences found in linear time by the above lemma applied to a sufficiently 
long prefix of p are enough to detect an occurrence of the whole pattern. 

Lemma 14. Fully compressed pattern matching can be solved in linear time if 
the pattern is of compressed size m > n and its period is at most "+ m . Fur- 
thermore, given a set of r potential occurrences we can verify all of them in 
0(n + m + r) time. 

Proof. We build the shortest prefix p[l . . ad] such that ad > n + m, where 
d < is the period of the whole pattern. Observe that ad < |(n + m) and 



hence we can afford to store this prefix in an uncompressed form. By Lemma 13 
we construct a set S of 0(n) occurrences of p[l . .ad] such that for any other 
occurrence starting at the i-th character there exists j € S such that < i — j < 
2 < |(n + m). We partition the elements in S according to their remainders 
modulo d so that S r — {j G S : j = r (mod d)} and consider each S r separately. 
Note that we can easily ensure that its elements are sorted by either applying 
radix sort to the whole S or observing that Lazy-levered-pattern-matching 
generate the occurrences from left to right. 

We split S r into maximal groups of consecutive elements x\ < x-i < . . . < Xk 
such that Xi+i < x,i + which clearly can be performed in linear time with a 
single left-to-right sweep. Each such group actually corresponds to a fragment 
starting at the xi-th character and ending at the Xk + ad — 1 character which is 
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a power of p[l . . d\. This is almost enough to detect an occurrence of the whole 
pattern. If the fragment is sufficiently long, we get an occurrence. In some cases 
this is not enough to detect the occurrence because we might be required to 
extend the period to the right as to make sufficient space for the whole pattern. 
Fortunately, it is impossible to repeat p[l . . d] more than |a times starting at 
the Xk character, as otherwise we would have another Xk+i G S r which we might 
have used to extend the group. Hence to compute how far the period extends 
it would be enough to align p[l . . ad]p[l . . ^] starting at the Xk character and 
compute the first mismatch with the text. We can assume that all suffixes of 
p[l . . ad] are blocks with just a linear increase in the problem size, and hence 
we can apply Lemma [9] to preprocess the input so that each such alignment can 
be processed in time proportional to the number of block in the corresponding 
fragment of the text. To finish the proof, note that any single block in the text 
will be processed at most twice. Otherwise we would have two groups ending at 
the Xk-th. and x' k ,-th characters such that \xk + ad — (x' k , + ad)\ < ^ and that 
would mean that one of those groups is not maximal. After computing how far 
the period extends after each group, we only have to check a simple arithmetic 
conditions to find out if the pattern occurs starting at the corresponding X\. 

To verify a set of r potential occurrences, we construct the groups and com- 
pute how far the period extends after each of them as above. Then for each 
potential occurrence starting at the fr^-th character we lookup the correspond- 
ing Sbi mod d and find the rightmost group such that x\ < hi. We can verify an 
occurrence by looking up how far the period extends after the a^-th character 
and checking a simple arithmetic condition. To get the claimed time bound, ob- 
serve that we do not have to perform the lookup separately for each possible 
occurrence. By first splitting them according to their remainders modulo d and 
sorting all x\ and hi in linear time using radix sort consisting of two passes we 
get a linear total complexity. 

5 Using kernel to accelerate pattern matching 

We start with computing all occurrences of the kernel in both the pattern and 
the text. Because the kernel is long and aperiodic, there are no more than 2m of 
the former and 2n of the latter. The question is if we are able to detect all those 
occurrences efficiently. It turns out that because the kernel is aperiodic, Lazy- 
LEVERED-PATTERN-MATCHING can be (again) used for the task. More formally, 
we have the following lemma. 

Lemma 15. Lazy-levered-PATTERN-MATCHING can be used to compute in 
0[n + m) time all occurrences of an aperiodic pattern of length m > n in a 
compressed text. 

Proof. By Lemma [13] we can construct in linear time a set of occurrences such 
that any other occurrence is quite close to one of them. But the pattern is 
aperiodic, so if it occurs at positions i and j with \i — j\ < then in fact i = j. 
Hence the set contains all occurrences. 
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We apply the above lemma to find the occurrences of the kernel in both the 
pattern and the text. Each occurrence of the kernel in the text gives us a possible 
candidate for an occurrence of the whole pattern (for example by aligning it with 
the first occurrence of the kernel in the pattern). Hence we have just a linear 
number of candidates to verify. Still, the verification is not trivial. An obvious 



approach would be to repeat a computation similar to the one from Lemma 10 
for each candidate. This would be too slow, though, as it might turn out that 
some blocks from the pattern are inspected multiple times. We require a slightly 
more sophisticated approach. 

Using (any) kernel decomposition of the pattern we represent it as p = P1P2P3, 
where the period of pi is at most , and pi is a kernel. We start with locating 
all occurrences of P2P3 in the text. It turns out that because P2 is aperiodic, 
there cannot be too many of them. Hence we can afford to generate all such 
occurrences and then verify if any of them is preceded by p\ as follows: 



14 



1- if |f>i| > n then we can directly apply Lemma 
2. if \px I < n then take the prefix of P1P2 consisting of the first n + m letters. 

Depending on whether this prefix is periodic with the period at most "T m 

or aperiodic, we can apply Lemma [14] or Lemma |T5) 

The most involved part is computing all occurrences of P2P3- To find them 
we construct a new string T by concatenating the suffix of the pattern and the 
text: 

T = P 2P3$t[l . . N] 

For this new string we compute the values of the prefix function defined in the 
following way: 

PREF[i] = max{j : T[k] = T[i + k-l] for all k = 1, 2, . . . ,j} 

Of course we cannot afford to compute PREF[i] for all possible N + M values of 
i. Fortunately, PREF[i] > \p2\ iff P2 occurs in T starting at the i-th character. 
Because \p^\ — n + m and P2 is aperiodic, there are no more than 2^t^ < n + m 
such values of i. We aim to compute PREF[i] just for those i. First lets take 
a look at the relatively well-known algorithm which computes all PREF[i] for 
all i, which can be found in the classic stringology book by Crochemore and 
Rytter [6]. We state its code for the sake of completeness. Naive-SCAN(:e, y) 
performs a naive scanning of the input starting at the x-ih and y-th characters. 
PREF uses this procedure in a clever way as to reuse already processed parts 
of the input and keep the total running time linear. The complexity is linear 
because the value of s + PREF[s] cannot decrease nor exceed \T\, and whenever 
it increases we are able to pay for the time spent in Naive-SCAN using the 
difference between the new and the old value. 

We will transform this algorithm so that it computes only PREF[i] such 
that the kernel occurs starting at the i-th character. We call such positions i 
interesting. The first problem we encounter is that we need a constant time 
access to any PREF[i] and cannot afford to allocate a table of length \T\. This 
can be easily overcome. 
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Algorithm 2 PREF(T[1 . . \T\}) 

1: PREF[1] = 0, s <- 1 

2: for i = 2,3, do 

3: k 4- i - s + 1 

4: r4-s + PREF[s] - 1 

5: if r < i then 

6 : PREF [ i] = N AIVE- S CAN (i , 1 ) 

7: if PREF[i] > then 

8: 84- i 

9: end if 

10: else if PREF[fc] + k < PREF[s] then 
11: PREF[i] 4- PREF[fc] 

12: else 

13: x 4- NAIVK-SCAN(r + 1, r — i + 2) 

14: PREF[i] 4- r - i + 1 + x 

15: s 4- i 

16: end if 

17: end for 

18: PREF[1] = |*| 



Lemma 16. A random access table PREF such that PREF[i] > iff the kernel 
occurs starting at the i-th character can be implemented in constant time per 
operation requiring space and preprocessing time not exceeding the compressed 
size ofT, which is 0{n + m). 

Proof. Observe that any two occurrences of the kernel cannot be too close. More 
precisely, their distance must be at least "t," . We split the whole T into disjoint 
fragments of size n+ ^ n . There are no more than 2(n + m) of them and there is 
at most one occurrence in each of them. Hence we can implement the table by 
allocating an array of size 2(n + m) with each entry storing at most one element. 

We modify line [2] so that it iterates only through i which are interesting. Note 
that whenever we access some PREF[j] inside, j is either i, s or k = i — s + 1. 
In the first two cases it is clear that the corresponding positions are interesting 
so we can access the corresponding value using Lemma |16| The third case is 
not that obvious, though. It might happen that k is not interesting and we will 
get PREF[fc] = instead of the true value. If r > i + \p2\ — 1 then because 
P2 occurs at i, it occurs at k as well, and so k is interesting. Otherwise we 
cannot access the true value of PREF[fc], so we start a naive scan by calling 
Naive-scan (i + \pz\, + 1) (we can start at the \p^ \ + 1-th character because 
p 2 occurs at i). After the scanning we set s 4- i. Note that because r < i+ \p 2 1 — 1, 
this increases the current value of s + PREF[s], and we can use the increase to 
amortize the scanning. 

We still have to show how to modify Naive-SCAN. Clearly we cannot afford to 
perform the comparisons character by character. By the increasing s + PREF[s] 
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argument, any single character from the text is inspected at most once by ac- 
cessing T\x\ (we call it a left side access). It might be inspected multiple times 
by accessing T[y], though (which we call a right side access). We would like to 
perform the comparisons block by block using Lemma [9j After a single query we 
skip at least one block. If we skip a block responsible for the left side access, we 
can clearly afford to pay for the comparison. We need to somehow amortize the 
situation when we skip a block responsible for the right side access. For this we 
will iteratively compress the input (this is similar to the idea used in [5] with the 
exception that we work with PREF instead of the failure function). More for- 
mally, consider the sequence of blocks describing P2P3- First note that no further 
blocks from T will be responsible for a right side access because of the unique $ 
character. Whenever some two neighboring blocks b\ , 62 from this prefix occur 
in the same block from the text 6', we would like to glue them, i.e., replace by a 
single block. We cannot be sure that there exists a block corresponding to their 
concatenation, but because we know where it occurs in b' we can extract (in 
constant time, by using the level ancestor data structure [2] to preprocess the 
whole code trie) a block for which the concatenation is a prefix. We will perform 
such replacement whenever possible. Unfortunately, after such replacement we 
face a new problem: P2P3 is represented as a concatenation of prefixes of blocks 
instead of whole blocks. Nevertheless, we can still apply Lemma [9] to compute 
the longest common prefix of two prefixes of blocks 6[l..i] and by first 

computing the longest common prefix of b and and decreasing it if it exceeds 
min(i, i'). More formally, we store a block cover of P2P3- 

Definition 2. A block cover of a word w is a sequence bi[l . . . . . , 6&[1 . . ik] 
of prefixes of blocks such that their concatenation is equal to w. 



w 












h[l..h] 




6*_i[u_i] 


h[l..ik] 









6j[l..ij] b ]+1 [l..i ]+l ] 






b 










6' 








V\l..ij + if+i] 





Fig. 1. A block cover of w. 



Fig. 2. Compressing the current block 
cover. 



This definition is illustrated on Figure [T] Obviously, the initial partition of 
P2P3 into blocks is a valid block cover. If during the execution of Naive-SCAN 
we find out that two neighboring elements bj [1 . . *i] , bj+\ [1 . . ij+i] of the current 
cover occur in some other longer block b, we replace them with the corresponding 
prefix of 6', see Figure [2] 

We store all bj[l . . ij] on a doubly linked list and update its element accord- 
ingly after each replacement. The final required step is to show how we can 
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quickly access in line [13] the block corresponding to r — i + 2. We clearly are 
allowed to spend just constant time there. We keep a pointer to the block cov- 
ering the r-th character of the pattern. Whenever we need to access the block 
covering the (r — i + 2)-th character, we simply move the pointer to the left, and 
whenever the current longest match extends, we move the pointer to the right. 
We cannot move to the left more time than we move to the right, and the latter 
can be bounded by the number of blocks in the whole P1P2 if we replace the 
neighboring blocks whenever it is possible. 

Lemma 17. Fully compressed pattern matching can be solved in linear time if 
we are given the kernel of the pattern. 

Theorem 1. Fully LZW-compressed pattern matching for strings over a poly- 
nomial size integer alphabet can be solved in optimal linear time assuming the 
word RAM model. 

6 LZW parse preprocessing 

The goal of this section is to prove Lemma[9j We aim to preprocess the codewords 
trie so that given any two codewords, we can compute their longest common 
suffix in constant time. It turns out that we can use some existing (while maybe 
not very known) tools to achieve that. The suffix tree of a tree A, where A is 
a tree with edges labeled with single characters, is defined as the compressed 
trie containing sa{v)$ for all v £ A, where sa(v) is the string constructed by 
concatenating the labels of all edges on the u-to-root path in A, see Figure [3] 
This has been first used by Kosaraju [T3], who developed a 0(\A\ log \A\) time 
construction algorithm, where \A\ is the number of nodes of A. The complexity 
has been then improved by Breslauer [5] to just log \U\) (which for constant 

alphabets is linear), and by Shibuya [Hj to linear for integer alphabets. 

suffix tree for [$,a$,b$,aa$,ba$,bb$,baa$,abb$,aabb$] 




Fig. 3. A trie (on the left) and its suffix tree (on the right). 



14 Pawel Gawrychowski 



We build the suffix tree of the codeword trie T in linear time [16] . As a 
result we also get for any node v of the input trie the node of the suffix tree 
corresponding to Sy(«)$. Now assume that we would like to compute the longest 
common suffix of two codewords corresponding to nodes u and v in the input trie. 
In other words, we would like to compute the longest common prefix of st(u) 
and st(v). This can be found in constant time after a linear time preprocessing 
by retrieving the lowest common ancestor of their corresponding nodes in the 
suffix tree pQ. 
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